Predicting Students’ Chance of Admission Using Beta Regression

Author

Marcus Chery and Keiron Green

Published

November 13, 2023

Introduction

Beta regression is a statistical methodology tailored to address the challenges associated with modeling continuous bounded response variables, often constrained within the open interval (0, 1). Beta regression, as introduced by (Ferrari and Cribari-Neto 2004) in their seminal paper, offers an elegant framework for modeling proportions and percentages with an emphasis on interpretability of model parameters. This foundational work serves as a cornerstone for subsequent research and applications.

This report delves into the application of beta regression within the context of heart disease data analysis. Our objective is to provide a thorough understanding of beta regression, its practical implementation using the “betareg” package in R, and its real-world relevance through heart disease data. Throughout the report, we offer practical examples, code snippets, and graphs to facilitate hands-on learning. Additionally, we address the limitations of beta regression and discuss potential future research directions. The goal of this report is to equip readers with the knowledge and skills needed to effectively utilize beta regression in heart disease analysis, while also fostering awareness of its boundaries and avenues for further exploration.

In beta regression, the dependent variable is assumed to follow a beta distribution, and the goal is to model its relationship with one or more independent variables. Beta regression is suitable for data where the response variable lies within a closed interval, such as proportions, percentages, or rates. It considers both the mean and the dispersion of the data (Ferrari and Cribari-Neto 2004) .

The beta distribution is characterized by two parameters: alpha (α) and beta (β). These parameters can be modeled as a function of predictor variables, allowing for the analysis of how independent variables influence the shape of the distribution.

Like logistic regression, beta regression often uses a link function to relate the mean of the beta distribution to the linear combination of predictor variables. Common link functions include logit, probit, and log-log. The coefficients obtained from beta regression provide insights into how changes in the predictor variables affect the distribution of the response variable, including its mean and variability.

Complementing this, the “betareg” package for R, as presented by (Cribari-Neto and Zeileis 2010), provides various methods and procedures for implementing beta regression models in R. Building on the foundation laid by (Ferrari and Cribari-Neto 2004), (Cribari-Neto and Zeileis 2010) introduce the “betareg” package for R, offering a practical implementation of beta regression models. This paper is instrumental in bringing beta regression into the realm of data analysis and showcases the package’s utility in addressing challenges related to continuous bounded response variables.

(Guolo and Varin 2014) seek to expand on beta regression. The authors incorporate a Gaussian copula to model serial dependence in the data, making it applicable to time series analysis. The study’s practical application involves analyzing influenza-like-illness incidence data from the Google Flu Trends project. Their methodology provides a framework for analyzing bounded time series data and offers insights into inference, prediction, and control in this context.

Moreover, the paper authored by (Patrícia L. Espinheira and Cribari-Neto 2008) unveils innovative residuals, based on Fisher’s scoring iterative algorithm, enhancing model assessments and bias correction—a pivotal contribution to the methodology’s practical applicability. They address a common challenge in statistical analysis by offering a practical solution for improved model assessments and bias correction. The study presents new residuals based on Fisher’s scoring iterative algorithm, demonstrating their superiority in various scenarios, particularly in small sample sizes.

Various papers throughout the years have sought to improve the basic beta regression techniques of (Ferrari and Cribari-Neto 2004). (Simas, Barreto-Souza, and Rocha 2010) delve into bias correction methods for beta regression models in the quest for improved estimators in beta regression. They address biases in parameter estimation, a critical concern. Recognizing the potential for bias in parameter estimation, especially in small sample sizes, the paper provides closed-form expressions for second-order biases in maximum likelihood estimators. The authors present various methods for bias correction, including analytical approaches and bootstrap bias adjustment, enhancing the accuracy of model assessments and mitigating erroneous conclusions.

Likewise, (Schmid 2013) introduced one such enhancement known as the “boosted beta regression” as a novel extension of classical beta regression. (Schmid 2013) address complex modeling scenarios with many predictor variables, multicollinearity, nonlinear relationships, and overdispersion dependencies. They utilize the gamboostLSS algorithm to efficiently estimate beta regression models, making it possible to fit models while preserving the structure of beta regression. The method’s effectiveness is demonstrated through ecological data applications, highlighting its advantages in terms of model fit and prediction accuracy.

(Abonazel et al. 2022) took a different approach. They focused on improving the estimation of beta regression models when multicollinearity was present among regressors. They introduced the Dawoud–Kibria estimator as an alternative to the conventional maximum likelihood estimator (BML) when dealing with such situations. The authors derived the properties of this new estimator and assess its performance through theoretical comparisons, Monte Carlo simulations, and a real-life application.

Finally, it is important to note the challenges of beta regression. (Douma and Weedon 2019) addresses one challenge of analyzing proportional data in ecology and evolution. Proportional data, common in these fields, pose unique statistical challenges due to their bounded nature and varying variances. The paper introduces beta and Dirichlet regression techniques as effective tools for analyzing such data. These methods allow for analysis at the original scale, reducing bias and simplifying interpretation. The article provides practical guidance on model fitting and reporting.

(Couri et al. 2022) focuses on the computational challenges of estimating parameters in beta regression models. The authors evaluate ten algorithms for parameter estimation and find that four, including simulated annealing, perform well. The paper also examines the optim function in R and suggests its use in challenging cases. Simulated annealing is recommended, particularly when the optim function fails. Overall, the research offers insights into improving parameter estimation for beta regression models in various applications.

Furthermore, one such limitation of classical beta regression is its handling of zero or one inflated data. (Ospina and Ferrari 2012) introduce a versatile class of regression models, zero-or-one inflated beta regression, specifically tailored to address datasets containing continuous proportions with zeros or ones. The paper offers a flexible solution for modeling mixed continuous-discrete distributions, essential in real-world scenarios such as income contributions or work-related activities. By combining the beta distribution with a degenerate distribution at zero or one, the authors provide a robust framework for effective modeling, parameter estimation, diagnostics, and model selection.

Methods

Our objective is to investigate the relationship and effect of the given variables on the response variable which in this case is the cholesterol count proportional. Being that the chosen response is proportional in nature, linear regression is not appropriate. The suitable approach is the implementation of beta regression. The reason as to why linear regression is not appropriate is due predicted values falling outside the bound of (0,1) where as all the proportional values are within the bounds of 0 and 1.Our objective is to implement beta regression to predict the proportion values based on grouped cholesterol values.In other words,using beta regression to predict the proportion of a specific cholesterol value given that chosen cholesterol value.Before implement the beta regression, it was necessary to create the proportional variable. This was done by counting similar cholesterol values and grouping them together then dividing the grouped values by the total number of examples in the data set.

After the creation the proportional response, Cholesterol_count_proportional,the beta regression algorithm is implemented.Before delving into the implementation of the beta regression let’s briefly analyze the mathematics of the beta regression algorithm. The beta regression is based on the logit function. The logit function plots probabilities between the interval 0 and 1. The logit function can be seen below. \[ logit(p) = log(p/(1-p))\]

Another formula which will need to be accounted for in predicting the proportionality values is the beta density.The beta density is especially useful in modelling probabilities as it works on the interval [0,1] When implementing beta regression it is assumed that the response variable follows a beta distribution hence the name beta regression.

The beta density formula is below. \[ f(x;\alpha,\beta) =\frac{ \Gamma(p + q)}{\Gamma(p)*\Gamma(q)}x^{p-1}(1-x)^{q-1} \] , \[ 0 < y < 1 \] where \(p,q\) > 0 and \(\Gamma (.)\) is the Gamma function. An alternate parameterization was proposed by (Ferrari and Cribari-Neto 2004) by setting \(u = p/(p+q)\) and \(\phi = p + q\)

\[ f(y;\mu,\phi)=\frac{\Gamma(\phi)}{\Gamma(\mu\phi)\Gamma((1-\mu)\phi)} y^{\mu\phi-1}(1-y)^{(1-\mu)(\phi-1)} \]

\[0<y<1\]

with \(0 < u < 1\) and \(\phi > 0\).

We denote \(y\sim B(u,\phi)\) where \(E(y)=u\) and \(VAR(y)=u(1-u)/(1+\phi)\). The \(\phi\) is known as the precision parameter. For a fixed value of \(u\) as the \(\phi\) increases the variance of the response variable \(y\) decreases.

Let \(y_1,.....,y_n\) be a random sample set given that $y_i(u_i,), i = 1,……,n. The beta regression model is defined as \[ g(u_i) = x^T_iB = n_i\] where \(B=(B_1,....,B_k)\) is a \(k*1\) vector of unknown regression parameters \((k<n),x_i=(x_i,......x_n)^T\) is the vector of k regressors(independent variables or covariates) and \(n_i\) is a linear predictor - i.e.,\(n_i = B_1x_{i1}+...+B_kx_{ik};\) usually \(x_{i1}\) = 1 for all \(i\) so that the model has an intercept.The \(g(.): (0,1) -> IR\) represent the link function.

Analysis and Results

This section aims to identify key variables for constructing a beta regression model to predict admission chances. Initially, we analyze the dataset to determine the correlation between various predictors and the chance of admission. Following this, we use these insights to develop and evaluate several models, focusing on their predictive accuracy and effectiveness. This process is critical for establishing a reliable model that accurately forecasts admission probabilities based on the identified significant predictors.

Data and Visualization

Data: University Admission Data

Attribute Information

Variable Parameter Range D escription
GRE Scores gre_score

290 - 340

(340 scale)

Quantifies a c andidate’s p erformance on the Graduate Record E x amination, with a maximum score of 340
TOEFL Scores t oefl_score

92 - 120

(120 scale)

Measures English language p r oficiency, scored out of a total of 120 points
University Rating univer s ity_rating 1 to 5 with 5 being the highest rating Rates u n iversities on a scale from 1 to 5, indicating their overall quality and r eputation.
Statement of Purpose (SOP) Strength sop 1 to 5 with 5 being the highest rating Evaluates the strength and quality of a c andidate’s SOP on a scale of 1 to 5
Letter of R e c o mmendation (LOR) Strength lor 1 to 5 with 5 being the highest rating Evaluates the strength and quality of a c andidate’s SOP and LOR on a scale of 1 to 5
U n d ergraduate GPA cgpa

6.8 - 9.92

(10.0 scale)

Reflects a student’s academic p erformance in their u n d ergraduate studies, scored on a 10-point scale
Research Experience research 0 or 1 Indicates whether a candidate has research experience (1) or not (0).
Chance of Admit chan c e_of_admit

0.34 - 0.97

(0 to 1 scale)

Represents the likelihood of a student being admitted, expressed as a decimal between 0 and 1

Libraries

Load necessary packages for analysis and modeling.

Code
#install.packages("janitor")
#install.packages("caTools")
#Insatll required packages
#install.packages('caret')
library(caret)
library(reshape2)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(betareg)
library(lmtest)
library(car)
library(rcompanion)
library(janitor)
library(here)
library(caTools)
library(pROC)

Loading Dataset

Code
admission <- read_csv("adm_data.csv")
#str(admission)

admission <- admission %>%
  clean_names()

head(admission,5)
# A tibble: 5 × 9
  serial_no gre_score toefl_score university_rating   sop   lor  cgpa research
      <dbl>     <dbl>       <dbl>             <dbl> <dbl> <dbl> <dbl>    <dbl>
1         1       337         118                 4   4.5   4.5  9.65        1
2         2       324         107                 4   4     4.5  8.87        1
3         3       316         104                 3   3     3.5  8           1
4         4       322         110                 3   3.5   2.5  8.67        1
5         5       314         103                 2   2     3    8.21        0
# ℹ 1 more variable: chance_of_admit <dbl>

Chance of Admit and TOEFL

Based on exploratory data analysis, TOEFL scores appear to be associated with a greater chance of admission, this chance of admission is further augmented by higher university ratings, and research experience appears to be a strong factor in increasing both TOEFL scores and the likelihood of admission.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit)) +
  geom_point(color = "#1f77b4", alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(x = "TOEFL Score", y = "Chance of Admission", title = "Linear and Smooth Fit: Chance of Admission vs TOEFL Score") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.position = "bottom")

The correlation coefficient of 0.791594 between TOEFL scores and the chance of admit suggests a strong positive relationship. This indicates that as TOEFL scores increase, the chance of admission tends to increase as well.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_jitter(width = .2, size = I(3)) +
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd")) +
  labs(x = "TOEFL Score", y = "Chance of Admission", color = "University Rating", 
       title = "Chance of Admit by TOEFL Score per University Ranking") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +
  geom_smooth(method = "lm", se = FALSE)

This trend indicates that applicants to higher-rated universities generally have a higher chance of admission. The standard deviation decreases as the university rating increases, suggesting more consistency in admission chances at higher-rated universities.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color = as.factor(research))) +
  geom_jitter(width = .2) +
  scale_color_manual(values = c("#270181", "coral")) +
  labs(x = "TOEFL Score", y = "Chance of Admission", color = "Research Experience", 
       title = "Chance of Admit by TOEFL Score per Research Experience") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

Applicants with research experience (Research = 1) have a higher average TOEFL score (approx. 110) compared to those without research experience (Research = 0), who have an average TOEFL score of about 104.

The mean chance of admission for applicants with research experience is significantly higher (approx. 0.796) than for those without (approx. 0.638).

This data suggests that research experience is positively associated with both higher TOEFL scores and a greater likelihood of admission.

Chance of Admit and GRE

These analyses suggest that higher GRE scores are strongly correlated with an increased chance of admission. The likelihood of admission also appears to be influenced by the university rating and is further enhanced by research experience.

Code
library(ggplot2)

ggplot(admission, aes(x = gre_score, y = chance_of_admit)) +
  geom_point(color = "#1f77b4", alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(x = "GRE Score", y = "Chance of Admission", 
       title = "Chance of Admit vs GRE Score") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.position = "bottom") +
  guides(color = guide_legend(title = "Type of Fit", 
                              override.aes = list(linetype = c("solid", "dashed"))))

A correlation coefficient of 0.8026105 indicates a strong positive relationship between GRE scores and the chance of admission. This suggests that higher GRE scores are generally associated with a higher likelihood of being admitted.

Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_point() + 
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd")) +
  labs(x = "GRE Score", y = "Chance of Admission", color = "University Rating", 
       title = "Chance of Admit by GRE Score per University Ranking") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

This trend suggests that applicants to higher-rated universities have a higher chance of admission, with the chance of admission being most favorable at the highest-rated universities.

Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color = as.factor(research))) +
  geom_point() + 
  scale_color_manual(values = c("#270181", "coral")) +
  labs(x = "GRE Score", y = "Chance of Admission", color = "Research Experience", 
       title = "Chance of Admit by GRE Score per Research Experience") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

Applicants with research experience (Research = 1) have a higher average GRE score (about 323) compared to those without research experience (Research = 0), who have an average GRE score of approximately 309. Similarly, the mean chance of admission is significantly higher for applicants with research experience (approx. 0.796) than for those without it (approx. 0.638). This indicates that research experience is positively associated with both higher GRE scores and a greater likelihood of admission.

Chance of Admit and CGPA

Code
#create scatterplot of x1 vs. y1
ggplot(admission, aes(x = cgpa, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_jitter(width = .2)+
  scale_color_manual(values = c("#270181", "coral","yellow","blue","green")) +
  ggtitle("Chance of Admit by GPA per University Rating")+
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"))

Chance of Admission Correlation Heatmap

Code
library(ggplot2)
library(reshape2)
library(viridis)

# Calculate the correlation matrix
data <- cor(admission[sapply(admission, is.numeric)], use = "complete.obs")

# Reshape data for ggplot
data1 <- melt(data)

# Create the heatmap
ggplot(data1, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_viridis(option = "C", direction = -1) +  # Using viridis color scale
  labs(title = "Admission Correlation Heatmap", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),
        axis.text.y = element_text(size = 10),
        plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "right") +
  geom_text(aes(label = round(value, digits = 2)), color = "white", size = 3)

The heatmap shows that GRE scores, TOEFL scores, and CGPA are strongly and positively correlated with the chance of admission. Research experience also positively influences admission chances, albeit to a lesser extent than academic scores.

Analysis and Results

Given the exploratory data analysis, we will now construct a predictive model for the response variable chance_of_admit.

Code
# Scaling of Data
#admission <- scale(admission)
#admission <- as.data.frame(admission)
Code
# Splitting the Data

#make this example reproducible
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(admission), replace=TRUE, prob=c(0.7,0.3))
train  <- admission[sample, ]
test   <- admission[!sample, ]

#view dimensions of training set
#dim(train)

#view dimensions of test set
#dim(test)

X_train <- train

X_test = subset(test,select = -c(chance_of_admit))
keeps <- c("chance_of_admit")
y_test = test[keeps]

Fitting Model 1

Model 1 = chance_of_admit ~ gre_score + toefl_score + university_rating + sop + lor + cgpa + research + sop*lor

Code
gy_logit <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + sop + lor + cgpa + research + sop*lor, data = train)

model_summary <- summary(gy_logit)
# Extracting coefficients for the mean model
coefficients_mean <- model_summary$coefficients$mean
coeff_mean_df <- data.frame(round(coefficients_mean, 4))
colnames(coeff_mean_df) <- c("Estimate", "Std. Error", "z value", "Pr(>|z|)")

# Extract log-likelihood and pseudo R-squared
log_likelihood <- round(model_summary$loglik, 2)
pseudo_r_squared <- round(model_summary$pseudo.r.squared, 4)

# Print the formatted results
print("Coefficients (Mean Model):")
[1] "Coefficients (Mean Model):"
Code
print(coeff_mean_df)
                  Estimate Std. Error  z value Pr(>|z|)
(Intercept)        -9.1585     0.7793 -11.7528   0.0000
gre_score           0.0087     0.0034   2.5672   0.0103
toefl_score         0.0209     0.0064   3.2752   0.0011
university_rating   0.0299     0.0299   1.0003   0.3172
sop                -0.2781     0.0733  -3.7949   0.0001
lor                -0.1030     0.0759  -1.3575   0.1746
cgpa                0.6348     0.0685   9.2616   0.0000
research            0.1516     0.0457   3.3205   0.0009
sop:lor             0.0716     0.0214   3.3503   0.0008
Code
cat("\nLog-Likelihood:", log_likelihood, "\nPseudo R-squared:", pseudo_r_squared)

Log-Likelihood: 414.39 
Pseudo R-squared: 0.8379
Code
library(leaps)
models <- regsubsets(chance_of_admit~. -serial_no, data = admission, nvmax = 9,
                     method = "seqrep")

library(broom)

model_summaries <- summary(models)

# Number of models
num_models <- nrow(model_summaries$which)

# Initialize an empty data frame for the summary
summary_table <- data.frame(Size = integer(), 
                            Variables = character(), 
                            R.Squared = numeric(), 
                            Adj.R.Squared = numeric(), 
                            BIC = numeric(), 
                            stringsAsFactors = FALSE)

# Fill the summary table
for (i in 1:num_models) {
  included_vars <- names(which(model_summaries$which[i, ]))
  vars_str <- paste(included_vars, collapse = ", ")
  summary_table <- rbind(summary_table, 
                         data.frame(Size = i, 
                                    Variables = vars_str, 
                                    R.Squared = model_summaries$rsq[i], 
                                    Adj.R.Squared = model_summaries$adjr2[i], 
                                    BIC = model_summaries$bic[i]))
}

# Print the table
print(summary_table)
  Size
1    1
2    2
3    3
4    4
5    5
6    6
7    7
                                                                         Variables
1                                                                (Intercept), cgpa
2                                              (Intercept), gre_score, toefl_score
3                                                (Intercept), gre_score, lor, cgpa
4                                      (Intercept), gre_score, lor, cgpa, research
5                 (Intercept), gre_score, toefl_score, university_rating, sop, lor
6           (Intercept), gre_score, toefl_score, university_rating, sop, lor, cgpa
7 (Intercept), gre_score, toefl_score, university_rating, sop, lor, cgpa, research
  R.Squared Adj.R.Squared       BIC
1 0.7626339     0.7620375 -563.2776
2 0.6925050     0.6909559 -453.7442
3 0.7941207     0.7925610 -608.2202
4 0.7986651     0.7966263 -611.1569
5 0.7504324     0.7472653 -519.2615
6 0.7987119     0.7956388 -599.2669
7 0.8034714     0.7999619 -602.8472

Next we use the`regsubsets` function to perform subset selection, aiming to identify the most predictive variables for the `chance_of_admit` outcome. This approach systematically evaluates combinations of up to seven predictors, such as `gre_score`, `toefl_score`, `cgpa`, and others, to determine their impact on the chance of admission. The goal is to find the optimal set of variables that best predict the admission outcome with varying model complexities.

The output indicates which variables are included in the best models at each complexity level, revealing that `cgpa`, `gre_score`, and `lor` are significant predictors in simpler models. As more variables are added, the model becomes more complex, potentially improving accuracy but also increasing the risk of overfitting. This analysis helps in understanding the relative importance of different academic and profile factors in determining admission chances.

Code
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(chance_of_admit ~. -serial_no, data = admission,
                    method = "leapSeq", 
                    tuneGrid = data.frame(nvmax = 1:7),
                    trControl = train.control
                    )
step.model$results
  nvmax       RMSE  Rsquared        MAE      RMSESD RsquaredSD       MAESD
1     1 0.06889355 0.7635275 0.05110306 0.009641911 0.06007377 0.007013934
2     2 0.07889825 0.6984116 0.06031014 0.013609817 0.06846210 0.010474096
3     3 0.06452847 0.7965067 0.04693197 0.009982503 0.05017603 0.007105953
4     4 0.06692766 0.7838953 0.04978752 0.015166549 0.06632972 0.012106201
5     5 0.06929365 0.7693051 0.05126744 0.008662209 0.05018380 0.005967808
6     6 0.06422188 0.7988891 0.04638031 0.010252374 0.05087166 0.007242793
7     7 0.06347942 0.8033827 0.04585758 0.010594007 0.05415636 0.007976196

In this analysis 10-fold cross-validation is employed to provide assessment of the model’s predictive performance by dividing the data into ten subsets and iteratively training the model on nine subsets while testing on the remaining one. The procedure is repeated for models with varying numbers of predictors. Key performance metrics such as RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error) are calculated for each model, showing how the inclusion of additional variables impacts the model’s accuracy. The results show variations in model performance across different numbers of predictors, with certain models achieving lower RMSE and higher R-squared values.

Code
#Predictions on the test set
#predictions <- predict(gy_logit,test[-9],type = "response")
#lrtest(gy_logit)
#AIC(gy_logit)
#vif(gy_logit) 
#Anova(gy_logit3)
Code
gy_logit <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa + research, data = train)


gy_logit1 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa , data = train)

gy_logit2 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + research, data = train)

gy_logit3 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + cgpa + research, data = train)

gy_logit4 <- betareg(chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research, data = train)

gy_logit5 <- betareg(chance_of_admit ~ gre_score + university_rating + lor + cgpa + research, data = train)

gy_logit6 <- betareg(chance_of_admit ~ toefl_score + university_rating + lor + cgpa + research, data = train)

gy_logit7 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa + research, data = train)

# Creating a data frame with AIC and pseudo R-squared values for each model
model_comparison <- data.frame(
  Model = c("gy_logit", "gy_logit1", "gy_logit2", "gy_logit3", "gy_logit4", "gy_logit5", "gy_logit6", "gy_logit7"),
  AIC = c(AIC(gy_logit), AIC(gy_logit1), AIC(gy_logit2), AIC(gy_logit3), AIC(gy_logit4), AIC(gy_logit5), AIC(gy_logit6), AIC(gy_logit7)),
  Pseudo_R_Squared = c(gy_logit$pseudo.r.squared, gy_logit1$pseudo.r.squared, gy_logit2$pseudo.r.squared, gy_logit3$pseudo.r.squared, gy_logit4$pseudo.r.squared, gy_logit5$pseudo.r.squared, gy_logit6$pseudo.r.squared, gy_logit7$pseudo.r.squared)
)


print(model_comparison)
      Model       AIC Pseudo_R_Squared
1  gy_logit -798.5518        0.8262585
2 gy_logit1 -791.6053        0.8198808
3 gy_logit2 -727.0029        0.7706337
4 gy_logit3 -791.7938        0.8229096
5 gy_logit4 -799.3962        0.8250244
6 gy_logit5 -792.9895        0.8202429
7 gy_logit6 -794.4413        0.8228542
8 gy_logit7 -798.5518        0.8262585

We employ beta regression models to predict the chance of admission, utilizing various combinations of predictors such as GRE scores, TOEFL scores, university ratings, letters of recommendation, CGPA, and research experience. The primary objective is to determine the optimal set of predictors that can accurately forecast admission chances while maintaining model simplicity.

The models, labeled `gy_logit` to `gy_logit7`, were evaluated using the Akaike Information Criterion (AIC) and the pseudo R-squared metric. A lower AIC value indicates a more efficient model in terms of the trade-off between goodness of fit and complexity, while a higher pseudo R-squared value suggests a better model fit.

The outcomes reveal that `gy_logit4` and `gy_logit7` yield the lowest AIC values, indicating they are potentially the most efficient models among those tested. Both these models, along with `gy_logit`, also exhibit the highest pseudo R-squared values, suggesting strong model fits.

Examining various model performance

gy_logit = chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa + research

AIC: -798.5518, Psuedo R-Squared: 0.8262585

Code
predictions <- predict(gy_logit,test[-9],type = "response")
df<- data.frame(predictions,y_test)
ggplot(df, aes(x=chance_of_admit, y = predictions )) +
  geom_point(color = "red") +
  geom_abline(intercept=0, slope=1, color = "yellow", lwd = 1) +
  labs(y='Predicted Values', x ='Actual Values', title='                                  Predicted vs. Actual Values') +
  theme_bw()

Code
predictions <- predict(gy_logit,test[-9],type = "response")
df<- data.frame(predictions,y_test)
ggplot(df, aes(x=1:nrow(df))) + 
  geom_line(aes(y = predictions), color = "darkred") + 
  geom_line(aes(y = test$chance_of_admit), color="steelblue")+
  labs( y = " ", x = " Number of rows ",title='                       Predicted values - red vs. Actual Values - blue') +
  theme_bw()

Other models

gy_logit4 = chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research

AIC: -799.3962, Psuedo R-Squared: 0.8250244

Code
predictions <- predict(gy_logit4,test[-9],type = "response")
df<- data.frame(predictions,y_test)
ggplot(df, aes(x=chance_of_admit, y = predictions )) +
  geom_point(color = "blue") +
  geom_abline(intercept=0, slope=1, color = "green", lwd = 1) +
  labs(y='Predicted Values', x ='Actual Values', title='                                  Predicted vs. Actual Values') +
  theme_bw()

gy_logit3 = chance_of_admit ~ gre_score + toefl_score + university_rating + cgpa + research

AIC: -791.7938, Psuedo R-Squared: 0.8229096

Code
plot(gy_logit3,main = "Model 3")

Code
plot(fitted(gy_logit3),
     residuals(gy_logit3))

Conclusion

References

Abonazel, Mohamed R., Issam Dawoud, Fuad A. Awwad, and Adewale F. Lukman. 2022. “Dawoud–Kibria Estimator for Beta Regression Model: Simulation and Application.” Frontiers in Applied Mathematics and Statistics 8. https://doi.org/10.3389/fams.2022.775068.
Couri, Lucas, Raydonal Ospina, Geiza da Silva, Víctor Leiva, and Jorge Figueroa-Zúñiga. 2022. “A Study on Computational Algorithms in the Estimation of Parameters for a Class of Beta Regression Models.” Mathematics 10 (3): 299. https://doi.org/10.3390/math10030299.
Cribari-Neto, Francisco, and Achim Zeileis. 2010. “Beta Regression in r.” Journal of Statistical Software 34 (2): 1–24. https://doi.org/10.18637/jss.v034.i02.
Douma, Jacob C., and James T. Weedon. 2019. “Analysing Continuous Proportions in Ecology and Evolution: A Practical Introduction to Beta and Dirichlet Regression.” Methods in Ecology and Evolution 10 (9): 1412–30. https://doi.org/https://doi.org/10.1111/2041-210X.13234.
Ferrari, Silvia, and Francisco Cribari-Neto. 2004. “Beta Regression for Modelling Rates and Proportions.” Journal of Applied Statistics 31 (7): 799–815. https://EconPapers.repec.org/RePEc:taf:japsta:v:31:y:2004:i:7:p:799-815.
Guolo, Annamaria, and Cristiano Varin. 2014. Beta regression for time series analysis of bounded data, with application to Canada Google® Flu Trends.” The Annals of Applied Statistics 8 (1): 74–88. https://doi.org/10.1214/13-AOAS684.
Ospina, Raydonal, and Silvia L. P. Ferrari. 2012. “A General Class of Zero-or-One Inflated Beta Regression Models.” Computational Statistics & Data Analysis 56 (6): 1609–23. https://doi.org/https://doi.org/10.1016/j.csda.2011.10.005.
Patrícia L. Espinheira, Silvia L. P. Ferrari, and Francisco Cribari-Neto. 2008. “On Beta Regression Residuals.” Journal of Applied Statistics 35 (4): 407–19. https://doi.org/10.1080/02664760701834931.
Schmid, Florian AND Maloney, Matthias AND Wickler. 2013. “Boosted Beta Regression.” PLOS ONE 8 (4): 1–15. https://doi.org/10.1371/journal.pone.0061623.
Simas, Alexandre B., Wagner Barreto-Souza, and Andréa V. Rocha. 2010. Improved estimators for a general class of beta regression models.” Computational Statistics & Data Analysis 54 (2): 348–66. https://ideas.repec.org/a/eee/csdana/v54y2010i2p348-366.html.